Skip to content

How to prepare a protein database

Alexey Nesvizhskii edited this page May 9, 2023 · 4 revisions

Proteomics pipelines and toolkits like Philosopher rely on properly formatted protein sequence databases to correctly identify peptides. Here are some tips on how to prepare a protein database for your experiment.

If you do not already have a protein sequence database: --id

Run Philosopher from the command line to download one from UniProt by executing the following two commands:

philosopher workspace --init
philosopher database --reviewed --contam --id UP000005640

This will generate a human UniProt/SwissProt (i.e. reviewed sequences only) database, with common contaminants and decoys added (with a default decoy prefix rev_). If you would like to use the full (unreviewed) UniProt proteome, remove the --reviewed tag.

For mouse, for example, use the proteome ID UP000000589. To find the proteome ID for other organisms, search within the UniProt proteomes.

To combine multiple proteomes, provide a comma-separated list, e.g.:

philosopher workspace --init
philosopher database --reviewed --contam --id UP000005640,UP000000625,UP000002311

to generate a database with the human, yeast, and E. coli proteomes.

If you have your own database without decoys and contaminants: --custom

Add decoys and contaminants and format it for FragPipe/philosopher using the following commands:

philosopher workspace --init 
philosopher database --custom <file_name> --contam 

If you have your own database with decoys and contaminants: --annotate

Reformat it for FragPipe using the following commands:

philosopher workspace --init 
philosopher database --annotate <file_name> --prefix <prefix>

Header formatting

If you need to run the --custom or the --annotate command, you may manually inspect the formatted files to ensure it will be compatible with Philosopher, it should follow one of these formats (see example for each):

  • UniProt: >sp|P02489|CRYAA_HUMAN Alpha-crystallin A chain OS=Homo sapiens OX=9606 GN=CRYAA PE=1 SV=2

  • NCBI: >NP_000385.1 alpha-crystallin A chain isoform 1 [Homo sapiens]

  • ENSEMBL: >ENSP00000291554.2 pep chromosome:GRCh38:21:43169008:43172805:1 gene:ENSG00000160202.7 transcript:ENST00000291554.6 gene_biotype:protein_coding transcript_biotype:protein_coding gene_symbol:CRYAA description:crystallin alpha A [Source:HGNC Symbol;Acc:HGNC:2388]

Note: the protein description text (e.g. "crystallin alpha A") should not contain any commas or special characters, as it may result in incorrect parsing of the entry by Philosopher

  • or generic: >P02489

If you are adding you own decoys, they also need to follow a specific formatting; sequences need to be formatted as a whole protein string in FASTA file with a decoy (e.g. rev_ or DECOY_) added at the beginning.

Examples of compatible decoy formats:

  • >rev_tr|J3KNE0|J3KNE0_HUMAN
  • >DECOY_tr|J3KNE0|J3KNE0_HUMAN

Examples of incompatible decoy formats:

  • >tr_REVERSED|J3KNE0|J3KNE0_HUMAN
  • >tr|fake_J3KNE0|J3KNE0_HUMAN RanBP2-like
  • >tr|J3KNE0_DECOY|J3KNE0_HUMAN